Back

Genetic Epidemiology

Wiley

Preprints posted in the last 90 days, ranked by how well they match Genetic Epidemiology's content profile, based on 46 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
Using Negative Control Outcomes to Detect Selection Bias in Mendelian Randomization Studies

Gkatzionis, A.; Davey Smith, G.; Tilling, K.

2026-02-01 epidemiology 10.64898/2026.01.30.26345215 medRxiv
Top 0.1%
23.0%
Show abstract

Mendelian randomization is currently mainly implemented through the use of genetic variants as instrumental variables to investigate the causal effect of an exposure on an outcome of interest. Mendelian randomization studies are robust to confounding bias and reverse causation, but they remain susceptible to selection bias; for example, this can happen if the exposure or outcome are associated with selection into the study sample. Negative controls are sometimes used to detect biases (typically due to confounding) in observational studies. Here, we focus specifically on Mendelian randomization analyses and discuss under what conditions a variable can be used as a negative control outcome to detect selection mechanisms that could bias Mendelian randomization estimates. We show that the main requirement is that the negative control outcome relates to confounders of the exposure and outcome. Counter-intuitively, the effect of the negative control on selection is of secondary concern; for example, a variable that does not affect selection can be a valid negative control for an outcome that does. We also investigate under what conditions age and sex can be used as negative control outcomes in Mendelian randomization analyses. In a real-data application, we investigate the pairwise causal relationships between 19 traits, utilizing data from the UK Biobank. Treating biological sex as a negative control outcome, we identify selection bias in analyses involving commonly used traits such as alcohol consumption, body mass index and educational attainment.

2
Testing for gene-environment (GxE) interaction using p-value aggregation identifies many GxE loci

Mishra, S.; Patra, R. R.; Reddy, A. S.; Mandal, A.; Majumdar, A.

2026-02-25 genomics 10.64898/2026.02.24.707798 medRxiv
Top 0.1%
14.9%
Show abstract

Genome-wide gene-environment (GxE) interaction studies have seen limited success in detecting reliable GxE signals. A standard genome-wide GxE scan assumes a single genetic mode of inheritance, such as an additive model. It can lead to reduced statistical power when the true genetic model is non-additive, such as a recessive model. We propose a robust GxE testing approach that uses Cauchy p-value aggregation. It combines the p-values from GxE tests based on the additive, dominant, and recessive genetic models. Using extensive simulation studies, we demonstrate that the p-value combination strategy offers a robust and powerful approach to identifying GxE interactions regardless of the underlying genetic model. The method is substantially more powerful than the additive model when the true genetic model is recessive. It is also more powerful than the general two-degree-of-freedom genotypic test for GxE interaction. We apply our approach to analyze GxE interactions in the UK Biobank data across several combinations of phenotypes and environmental factors. For glycated hemoglobin (HbA1c) level, treating cumulative smoking exposure as the lifestyle factor, our approach identified 82 independent GxE loci while controlling FDR at 5%. The GxE test based on the additive genetic model detected 24 loci. For type 2 diabetes with sleep duration as a lifestyle factor, the proposed approach detected 563 independent GxE loci at 5% FDR, substantially exceeding the number of discoveries by the other approaches.

3
Mediation analysis in longitudinal data: an unbiased estimator for cumulative indirect effect

Li, Y.; Cabral, H.; Tripodis, Y.; Ma, J.; Levy, D.; Joehanes, R.; Liu, C.; Lee, J.

2026-04-20 epidemiology 10.64898/2026.04.18.26351189 medRxiv
Top 0.1%
14.5%
Show abstract

Mediation analysis quantifies how an exposure affects an outcome through an intermediate variable. We extend mediation analysis to capture the cumulative effects of longitudinal predictors on longitudinal outcomes. Our proposed model examines how mediators transmit the effects of the current and previous exposure on the current outcome. We construct a least-squared estimator for cumulative indirect effect (CIE) and used three approaches (exact form, delta method, and bootstrap procedure) to estimate its standard error (SE). The estimator of CIE is unbiased with no unmeasured confounding and independent model errors between mediator model and outcome model at all time points, as shown in statistical inference and in simulations. While three SE estimates are numerically similar, bootstrap procedure is recommended due to its simplicity in implementation. We apply this method to Framingham Heart Study offspring cohort to assess if DNA methylation mediates the association of alcohol consumption with systolic blood pressure over two time points. We identify two CpGs (cg05130679 and cg05465916) as mediators and construct a composite DNA methylation score from 11 CpGs, which mediates for 39% of the cumulative effect. In conclusion, we propose an unbiased estimator for CIE. Future studies will investigate the missingness in mediators and outcomes.

4
Widespread genetic effect heterogeneity impacts bias and power in nonlinear Mendelian randomization

Wang, J.; Morrison, J.

2026-04-20 epidemiology 10.64898/2026.04.17.26351133 medRxiv
Top 0.1%
13.0%
Show abstract

1Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causal relationships between complex traits. Standard MR can be used to estimate an average causal effect at the population level, and typically assumes a linear exposure-outcome relationship. Recently, several methods for estimating nonlinear effects have been developed. However, many have been found to produce spurious empirical findings when subjected to negative control analyses. We propose that this poor performance may be attributable to heterogeneity in variant-exposure associations. We demonstrate that heterogeneous genetic effects on exposure lead to biased estimates, poor coverage, and inflated type I error in control function and stratification-based methods. In contrast, two-stage least squares (TSLS) methods are robust to such heterogeneity, but suffer from low precision and low power in some circumstances. We show that a statistical test for heterogeneity can be used to guide the choice of nonlinear MR methods. Using UK Biobank data, we reassess the causal effects of BMI, vitamin D, and alcohol consumption on blood pressure, lipid, C-reactive protein, and age (negative control). We find strong evidence of heterogeneity for all three exposures, and also recapitulate previous results that control function and stratification-based methods are prone to false positives. Finally, using nonparametric TSLS, we identify evidence of nonlinear causal effects of BMI on HDL cholesterol, triglycerides, and C-reactive protein; however, specific estimates of the shape of these relationships are imprecise. Altogether, our results suggest that common nonlinear MR methods are unreliable in the presence of realistic levels of heterogeneity, and that more methodological development is required before practically useful nonlinear MR is feasible.

5
Comparison of methods for assessing effects of risk factors on disease progression in Mendelian randomization under index event bias

Zhang, L.; Higgins, I. A.; Dai, Q.; Gkatzionis, A.; Quistrebert, J.; Bashir, N.; Dharmalingam, G.; Bhatnagar, P.; Gill, D.; Liu, Y.; Burgess, S.

2026-03-02 epidemiology 10.64898/2026.02.26.26347193 medRxiv
Top 0.1%
9.8%
Show abstract

Mendelian randomization has emerged as a transformative approach for inferring causal relationships between risk factors and disease outcomes. However, applying Mendelian randomization to disease progression - a critical step in validating pharmacological targets - is hampered by index event bias. This form of selection bias occurs because analyses of disease progression are necessarily restricted to individuals who have already experienced the disease event. Here, we present a comprehensive evaluation of statistical methods designed to mitigate index event bias, including inverse-probability weighting, Slope-Hunter, and multivariable methods. We compare the performance of these methods in simulations and applied examples. Inverse-probability weighting methods reduce bias, but require individual-level data and will only fully eliminate bias when the disease event model is correctly specified. Slope-Hunter performed poorly in all simulation scenarios, even when its assumptions were fully satisfied. Multivariable methods worked best when including genetic variants that affect the incident disease event. However, if these genetic variants also affect disease progression directly, then the analysis will suffer from pleiotropy. Hence, if the same biological mechanisms affect disease incidence and progression, then multivariable methods will have little utility. But in such a case, analyses of disease progression are less critical, as conclusions reached from analyses of disease incidence are likely to hold for disease progression. Our findings indicate that no single method is a universal solution to provide reliable results for the investigation of disease progression. Instead, we propose a strategic framework for method selection based on data availability and biological context.

6
Validity and Interpretation of Two-Sample Mendelian Randomization with Binary Traits

Wu, Z.; Wang, J.

2026-02-18 genetics 10.1101/2024.06.09.598150 medRxiv
Top 0.1%
8.5%
Show abstract

BackgroundTwo-sample Mendelian randomization (MR) is widely applied to binary exposures and outcomes. Yet standard MR models rely on linear effect assumptions that are difficult to interpret for binary traits. Although liability-based interpretations have been suggested, it remains unclear whether conventional summary-data MR is formally justified in this setting or what causal parameter it identifies. MethodsWe develop a liability-threshold framework in which binary traits arise from underlying continuous liabilities. We derive explicit relationships between genome-wide association study (GWAS) coefficients obtained from logistic or linear regression on binary traits and marginal genetic associations on the liability scale. Under small genetic effects, typical for complex traits, observed-scale GWAS coefficients are approximately proportional to liability-scale associations. ResultsThis proportionality implies that standard two-sample MR methods remain statistically coherent for binary traits. MR applied to binary exposures or outcomes estimates a scaled causal effect between underlying liabilities rather than an effect on the observed binary scale. The scaling factor depends primarily on trait prevalence and is directly computable. Simulations and UK Biobank analyses confirm that, after rescaling, MR using binary traits recovers liability-scale causal effects consistent with analyses based on continuous traits. ConclusionsWe provide a formal statistical justification for summary-data MR with binary traits and clarify the causal parameter being estimated. These results support routine MR practice for binary exposures and outcomes while enabling coherent interpretation of effect sizes. Key MessagesO_LIThe interpretation of two-sample MR with binary exposures or outcomes is often unclear because GWAS analyses are performed on the observed binary scale. C_LIO_LIUnder a liability threshold framework with small genetic effects, GWAS coefficients from logistic or linear regression on binary traits are approximately proportional to genetic associations on an underlying continuous liability scale. C_LIO_LIConsequently, conventional summary-data MR applied to binary or ordinal traits remains valid and estimates a scaled causal effect between liabilities, requiring no modification of existing methods. C_LI

7
A Most Powerful Test for Gene-Gene Interaction in the Presence of Main Effects

Romanescu, R.; Liu, M.

2026-02-02 genetics 10.64898/2026.01.30.702572 medRxiv
Top 0.1%
8.5%
Show abstract

We consider the problem of optimal testing for genetic interaction between two variants, allowing for possible main effects. Finding a most powerful test is important because it ends a series of attempts in the literature to construct ever more powerful tests for interaction at the variant pair level. Testing under a logistic regression model is known to be underpowered, partly because patterns of enrichment in the genotypes themselves are lost when regarding genotypes solely as predictors. Instead, we use the retrospective likelihood approach, which makes use of all the data by treating genotypes as outcomes alongside affection status. Using a parsimonious parameterization of penetrance based on the risk ratio, which links directly to the population prevalence and avoids having to estimate an intercept term, we construct an approximate uniformly most powerful unbiased test for interaction. This test is based on optimal testing theory and accounts for nuisance main effects without requiring their explicit estimation. The test statistic can be easily modified for optimal testing under other modes of genetic interaction, such as recessive x recessive or dominant x dominant. We demonstrate significant power gains compared to the odds-ratio-based PLINK test, in simulation studies. Finally, we apply the test to scan for interactions in IBD cases and controls from the UK Biobank. The top SNP pairs show enrichment for a pathway related to existing therapies for IBD.

8
Prenatal Alcohol Exposure and Mental Health Outcomes: A Two-Sample Mendelian Randomization Study of DNA Methylation Signatures

Luo, D.; Lussier, A. A.

2026-02-03 epidemiology 10.64898/2026.01.30.26345158 medRxiv
Top 0.1%
7.3%
Show abstract

Prenatal alcohol exposure (PAE) can lead to a range of deficits falling under the umbrella of Fetal Alcohol Spectrum Disorder (FASD), which included higher risk for adverse neurodevelopmental and mental health outcomes. Although the biological mechanisms underlying the link between PAE and mental health remain unclear, DNA methylation (DNAm), an epigenetic modification responsive to environmental exposures, may explain these relationships. Here, we applied a two-sample Mendelian randomization (MR) framework to assess whether DNAm loci previously associated with PAE or FASD are linked to 11 psychiatric outcomes. Using summary statistics from the Genetics of DNA Methylation Consortium (GoDMC) mQTL database and large-scale GWAS, we analyzed DNAm loci from two epigenome-wide association studies: one examining FASD by Lussier et al. (2018) and one examining PAE patterns by Sharp et al. (2018). A total of 106 associations (Lussier) and 28 associations (Sharp) reached nominal significance (p<0.05) and passed sensitivity tests, with several surviving multiple testing correction. Notably, schizophrenia and bipolar disorder had the highest number of associated loci across both studies. Functional analysis showed that DNAm loci were enriched in signaling pathways, embryonic development, and neuron differentiation. Regional enrichment analysis revealed that FASD-related loci were more likely to occur in enhancer and south shore, implicating distal regulatory elements. PAE patterns conferred heterogeneous effects on DNAm and mental health risk, underscoring the complexity of timing-specific epigenetic vulnerability. These findings offer novel insights into the potential mechanism of DNAm linking PAE to mental health, and demonstrate the utility of MR in epigenetic epidemiology.

9
Controlling for confounds in UK Biobank brain imaging data with small subsets of subjects

Radosavljevic, L.; Smith, S.; Nichols, T. E.

2026-03-03 epidemiology 10.64898/2026.03.02.26347455 medRxiv
Top 0.1%
7.0%
Show abstract

The UK Biobank (UKB) Brain Imaging cohort contains data from almost 100,000 subjects and has yielded invaluable understanding of the links between the brain and health outcomes and lifestyles. Much of the understanding of these links has come from exploring the association between Imaging Derived Phenotypes (IDPs) and other variables that are unrelated to brain imaging, so called non-Imaging Derived Phenotypes (nIDPs). When performing analysis of this kind, it is very important to control for well known confounding factors such as age, sex and socio-economic status, as well as confounds which are related to the imaging protocol itself. In previous work, we created a pipeline for constructing imaging confounds for use in statistical inference via a standard multivariate linear regression approach (Alfaro-Almagro et. al. 2021). However, this approach is problematic when the number of confounds exceeds the number of subjects, and is severely underpowered when the number of number of subjects is not much larger than the number of confounds. In this work, we perform a simulation study to evaluate 13 modelling approaches to account for confounds when their number is similar to or exceeds the number of subjects. Based on the simulation results, we recommend a ridge regression based permutation test for low sample sizes (n [&le;] 50), a version of de-sparsified LASSO for intermediate sample sizes (50 < n [&le;] 500), and multivariate linear regression aided by Principal Component Analysis (PCA) for larger sample sizes (n > 500). We also demonstrate the use of our recommended methodology on a real data example of finding associations between Alzheimers Disease (AD) and IDPs.

10
Bias in genome-wide association test statistics due to omitted interactions

Yelmen, B.; Güler, M. N.; Estonian Biobank Research Team, ; Kollo, T.; Möls, M.; Charpiat, G.; Jay, F.

2026-02-22 bioinformatics 10.1101/2025.11.21.689603 medRxiv
Top 0.1%
6.9%
Show abstract

Over the past two decades, genome-wide association studies (GWAS) enabled the discovery of thousands of variants associated with many complex human traits. However, conventional GWAS are still widely performed with linear models with the assumption that the genetic effects are predominantly additive. In this work, we investigate the test statistic behavior when linear models are used to obtain significant genotype-phenotype associations without accounting for epistasis. We first algebraically derive mean and variance shift in the null statistic due to the omitted interaction term, and define the boundary between conservative (i.e., deflated statistic tail) and anti-conservative (i.e., inflated statistic tail) regimes for the common GWAS significance threshold. We then perform phenotype simulation analyses using the Estonian Biobank genotypes and validate the mathematical model. We demonstrate that the anti-conservative regime is plausible under realistic parameter settings and models omitting interaction terms can produce spurious significance. Our findings suggest caution when interpreting statistically significant signals reported in the literature based on linear models, especially for large-scale GWAS.

11
XMR: A cross-population Mendelian randomization method for causal inference using genome-wide summary statistics

Huang, X.; Chao, Z.; Wang, Z.; Hu, X.; Yang, C.

2026-03-10 genetic and genomic medicine 10.64898/2026.03.10.26348003 medRxiv
Top 0.1%
6.7%
Show abstract

Mendelian randomization (MR) is an important tool for inferring causal relationships between exposures (like lifestyle factors or biomarkers) and health outcomes using genome-wide association study (GWAS) summary data, yet the small sample sizes of non-European populations often result in insufficient instrumental variables (IVs) and unreliable causal effect estimates. In this paper, we consider causal inference in underrepresented populations to improve global health equity. We propose a statistical method for cross-population MR, XMR, to enhance causal inference in these target populations by using auxiliary GWAS summary statistics from global biobanks. By leveraging the shared genetic basis of exposure traits in the target and auxiliary populations, XMR increases the number of IVs while maintaining robust estimates via rigorous evaluation of IV validity and accounting for confounding factors. Through extensive simulations and real-data analyses, we demonstrate that XMR can achieve greater statistical power, better control of false positive rates and more replicable results compared to existing methods. Notably, XMR successfully identifies novel causal relationships in our studies of the East Asian (including Japanese and Taiwanese), Central/South Asian, and African populations. These findings reveal potential heterogeneity in causal patterns across populations, highlighting the importance of causal inference in underrepresented populations.

12
Deriving LD-adjusted GWAS summary statistics through linkage disequilibrium deconvolution

Nouira, A.; Favre Moiron, M.; Tournaire, M.; Verbanck, M.

2026-04-11 genetic and genomic medicine 10.64898/2026.04.10.26350574 medRxiv
Top 0.1%
6.3%
Show abstract

Genome-wide association studies (GWAS) have identified numerous genetic variants associated with complex traits. However, linkage disequilibrium (LD) confounds these associations, leading to false positives where non-causal variants appear associated because they are correlated with nearby causal variants. This is particularly the case in highly polygenic traits where the genome can be saturated in causal variants. To address this issue, we propose LDeconv a method based on truncated singular value decomposition (SVD) that adjust GWAS summary statistics without requiring individual-level genotype data. This approach accounts for LD structure, isolates causal variants in high-LD regions, and improve the reliability of effect size estimates. We assess its performance through simulations across various LD scenarios, conduct extensive sensitivity analyses, and apply them to real GWAS data from the UK Biobank. Our results demonstrate that LDeconv effectively reduces false discoveries while preserving true associations, offering a robust framework for post-GWAS analysis.

13
The prevalence of protein misfolding as a mechanism for hereditary deafness

Gogal, R. A.; Cox, G. M.; Kolbe, D. L.; Odell, A. M.; Ovel, C. E.; McCormick, K. I.; Hong, B.; Azaiez, H.; Casavant, T. L.; Smith, R. J. H.; Braun, T. A.; Schnieders, M. J.

2026-03-11 genetics 10.64898/2026.03.09.710547 medRxiv
Top 0.1%
6.3%
Show abstract

Hearing loss is the most common sensory deficit impacting [~]5% of the worlds population. The Deafness Variation Database (DVD) is a public resource of deafness variants, containing over 380,000 missense variants across 224 genes, with 303,577 classified as a variant of uncertain significance (VUS). To address the challenge of evaluating each deafness associated VUS, we evaluate a family of probabilistic frameworks to quantify the strength of computational evidence based on ACMG/AMP recommendations. First, CADD and REVEL are compared using Bayesian models parameterized using either a ClinVar 2019 dataset or labeled DVD variants. The REVEL model built using the DVD dataset demonstrates the best accuracy, sensitivity, and specificity. Incorporation of (in)tolerance to missense variation based on sorting each gene into three bins (tolerant, average, intolerant) shows that intolerant DVD genes are consistent with a higher prior probability of being pathogenic (25.7%) than average (10.7%) or tolerant (8.7%) genes. Finally, the impact of protein folding stability was incorporated using a 2D likelihood, which surpassed the simpler models while also offering a biophysical rationale for the disease mechanism. The protein folding-informed Bayesian model results in 28,866 prioritized VUSs reaching a posterior probability of pathogenicity above 98% with a false positive rate of only 0.14%. Overall, 54,752 missense variants are predicted to cause protein folding destabilization of greater than 1.0 kcal/mol, while 18,706 of the 28,886 prioritized VUS (65%) surpass this threshold. From these VUSs, we identify twelve probands where the patients genetic diagnosis is upgraded to likely pathogenic/pathogenic. We highlight two variants that cause clear structural disruption, demonstrating the impact of biophysical characterization on variant evaluation. Author SummaryWe investigate the impacts of single amino acid changes on protein structure and folding in the context of hearing loss. Hearing loss is the most common impairment of the main senses affecting nearly 5% of the worlds population. About 45% of people with hearing loss receive a diagnosis after targeted genetic testing. Here, we integrate biophysical data that quantifies the effect of a change to protein sequence on protein folding in combination with genetic data to improve our ability to identify protein amino acid changes that are likely to impact hearing. Our work leads to 12 patients receiving an upgraded diagnosis with their variant disrupting protein stability. Although the method is applied to hearing loss, it can be used for interpreting protein sequence changes in other disease contexts.

14
Beyond single-slope Mendelian randomization: structural representation of genetic heterogeneity in joint effect space

Hao, H.; Chen, D.; Qian, C.; Zhou, X.; Huang, H.; Zuo, J.; Wang, G.; Peng, X.; Liu, H.-X.

2026-03-14 genetic and genomic medicine 10.64898/2026.03.12.26348288 medRxiv
Top 0.1%
6.3%
Show abstract

Causal effects in complex traits are typically represented by a single linear slope. While conventional Mendelian randomization (MR) provides efficient scalar estimates, projection-based summaries do not explicitly capture structural organisation in joint effect space under genetic heterogeneity. We introduce MR-UBRA (Mendelian randomization-Unified Bayesian Risk Architecture), a probabilistic framework that decomposes instrumental variants into genetic risk fragments (GRFs) and quantifies extreme deviations using tail-risk metrics defined on the standardised residual magnitude |e|. MR-UBRA preserves the classical MR estimand while offering a structurally resolved representation of genetic heterogeneity. Across stroke subtypes, AF[-&gt;]CES, smoking[-&gt;]lung cancer, and BMI[-&gt;]T2D, effect-space distributions exhibit reproducible asymmetry, amplitude stratification, and multi-modal structure. MR-UBRA resolves component-level organisation, separating tail-dominant contributions from the causal core while maintaining consistency with the classical MR estimand. Simulations and boundary regimes demonstrate adaptive model complexity: MR-UBRA selects K>1 when multi-component structure is present and collapses to K=1 under homogeneous conditions, avoiding spurious stratification. These results support viewing causal effects in complex traits as structured distributions in joint effect space, enhancing causal representation without altering the MR estimand. Graphical AbstractMendelian randomization (MR) typically represents causal effects with a single linear slope. Under genetic heterogeneity, instrumental effects in joint ({beta}X, {beta}Y) space may exhibit multi-component structure and amplitude stratification that cannot be captured by a scalar summary. MR-UBRA fits a standard error-weighted mixture model to decompose instruments into genetic risk fragments (GRFs), estimates GRF-specific effects using posterior-weighted soft-IVW, and quantifies extreme deviations through tail-risk metrics (VaR/CVaR). Across empirical analyses and boundary regimes, MR-UBRA adapts model complexity (K) to structural signal, collapsing to K=1 under homogeneous conditions. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=144 SRC="FIGDIR/small/26348288v1_ufig1.gif" ALT="Figure 1"> View larger version (31K): org.highwire.dtl.DTLVardef@1627086org.highwire.dtl.DTLVardef@1c9982eorg.highwire.dtl.DTLVardef@262730org.highwire.dtl.DTLVardef@d6e551_HPS_FORMAT_FIGEXP M_FIG C_FIG

15
Methodological Considerations in Sibling Analyses of Prenatal Acetaminophen

Ahlqvist, V. H.; Sjoqvist, H.; Gardner, R. M.; Lee, B. K.

2026-03-30 epidemiology 10.64898/2026.03.27.26349515 medRxiv
Top 0.1%
4.9%
Show abstract

Background: Sibling-matched designs control for shared familial confounding but remain vulnerable to non-shared confounders. Bi-directional sensitivity analyses, which stratify families by whether the older or younger sibling was exposed, are commonly used to assess carryover effects. We aimed to demonstrate how this methodological approach can introduce severe confounding by parity. Methods: We conducted simulations motivated by a recent epidemiological study. The true causal effect of a hypothetical exposure (prenatal acetaminophen) on neurodevelopmental outcomes was set to strictly null. To introduce parity-related confounding, baseline exposure and outcome probabilities were varied slightly by birth order. We compared conditional logistic regression effect estimates from total sibling models against bi-directional stratified models. Results: In the total simulated sibling cohort, models yielded the true null effect (odds ratio = 1.00) when adjusting for parity. However, the bi-directional analyses exhibited divergent artifactual signals. Because parity is perfectly collinear with exposure in these stratified subsets, it cannot be adjusted for. For example, when the older sibling was exposed, the odds ratio for autism spectrum disorder was 1.68; when the younger was exposed, the odds ratio was 0.60. Conclusions: Divergent estimates in bi-directional sibling analyses can be a predictable artifact of parity confounding rather than evidence of carryover effects or invalidating unmeasured bias. Overall sibling models adjusting for parity may remain robust despite divergent stratified sensitivity results.

16
Do Amyloid Trajectories Reach a Physiologic Ceiling? Evidence from Iterative Approximation and Simulation

Gantenberg, J. R.; La Joie, R.; Heston, M. B.; Ackley, S. F.

2026-04-21 epidemiology 10.64898/2026.04.14.26350359 medRxiv
Top 0.2%
4.3%
Show abstract

Qualitative models of Alzheimers pathology often posit that amyloid accumulation follows a sigmoid curve, indicating that the rate of deposition wanes over time. Longitudinal PET data now allow us to investigate amyloid accumulation trajectories with greater detail and over longer follow-up periods. We combine inferences from simulated amyloid trajectories, empirical PET data from the Alzheimers Disease Neuroimaging Initiative (ADNI), and the sampled iterative local approximation algorithm (SILA) to assess whether amyloid accumulation reaches a physiologic ceiling. We find that SILA reliably detects a ceiling, when present, across a range of simulated scenarios that impose a sigmoid shape. When fit to empirical data from ADNI, however, SILA does not appear to indicate the presence of a ceiling. Thus, we conclude that amyloid trajectories may not reach a physiologic ceiling during the stages of Alzheimers disease typically observed while patients remain under follow-up in cohort studies. Fits using SILA indicate that illustrative models of biomarker cascades, while useful tools for conceptualizing and interrogating pathologic processes, may not represent the shapes of amyloid trajectories accurately. Summary for General PublicAmyloid, a protein implicated in Alzheimers disease, is thought to reach a plateau in the brain, but methods that estimate how amyloid changes over time suggest it grows unabated. Gantenberg et al. use one such method and simulations to argue that amyloid does not reach a plateau during the typical course of Alzheimers.

17
Systematic assessment of machine learning-based variant annotation methods for rare variant association testing

Aguirre, M.; Irudayanathan, F. J.; Crow, M.; Hejase, H. A.; Menon, V. K.; Pendergrass, R. K.; McCarthy, M. I.; Fletez-Brant, K.

2026-03-20 bioinformatics 10.64898/2026.03.18.712715 medRxiv
Top 0.2%
4.2%
Show abstract

Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods -- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA -- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein dis-tances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant re-sults that were enriched (1.8-5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our anal-ysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.

18
Mapping the Dynamic Interplay of Mental Health and Weight Across Childhood: Data-Driven Explorations Using Causal Discovery

Larsen, T. E.; Lorca, M. H.; Ekstrom, C. T.; Vinding, R.; Bonnelykke, K.; Strandberg-Larsen, K.; Petersen, A. H.

2026-04-17 epidemiology 10.64898/2026.04.16.26350943 medRxiv
Top 0.2%
4.0%
Show abstract

Childhood weight development, especially overweight and obesity, has been associated with mental health, but their dynamic, causal relationships, and whether these differ by sex, remain unclear. We applied causal discovery to data from the Danish National Birth Cohort (n=67,593) spanning six periods from pregnancy to late adolescence and considering 67 variables related to child and parental weight, mental health, lifestyle, and socio-economic factors. We found no statistically significant difference between the causal graphs for boys and girls (P=0.079). The data-driven models found causal influence of childhood weight on subsequent weight status. Mental health pathways were exclusively within or across adjacent periods and centered on early adolescent stress. We examined the interplay between a subset of mental health variables, containing information on externalizing and internalizing problems, and weight, and found no direct causal pathway between the two processes. These findings suggest that observed links between weight and these mental health measures may be attributable to confounding. Our findings demonstrate the value of data-driven causal discovery in large cohort studies and how to test for differences in causal mechanisms across subgroups. Results are available in an interactive application, enabling future research to further explore the interplay between weight and mental health.

19
EA-PheWAS: Integrating Phenotype Embeddings with PheWAS for Enhanced Gene-Phenotype Discovery

Zheng, W.; Liu, T.; Xu, L.; Xie, Y.; Jing, Y.; Shao, H.; Zhao, H.

2026-04-22 genetics 10.64898/2026.04.21.720031 medRxiv
Top 0.2%
3.9%
Show abstract

Phenome-wide association studies (PheWAS) enable systematic exploration of relationships between genetic variants and clinical phenotypes derived from electronic health records (EHRs). Conventional regression-based PheWAS treats phenotypes separately and relies on binary phenotype representations, which limits statistical power for rare variants and rare phenotypes and reduces the ability to detect associations with phenotypes that are distributed across clinical codes. To address this limitation, we first developed EmbedPheScan, a phenotype embedding-based prioritization framework that summarizes the phenotypic profiles of rare loss-of-function variant carriers in a continuous embedding space. We then proposed EA-PheWAS by combining these embedding-derived signals with conventional regression-based PheWAS results using the aggregated Cauchy association test. Using the UK Biobank whole-exome sequencing and EHR data, we show that the proposed methods maintain appropriate false-positive control. We then performed genome-wide phenome scans across all genes and across biologically defined gene classes to evaluate EA-PheWAS relative to conventional PheWAS and EmbedPheScan, consistently finding that EA-PheWAS outperformed the other two methods. We illustrate the utility of EA-PheWAS focusing on four genes representing distinct scenarios, including strong-effect disease genes (PKD1, PKD2), genes with large numbers of rare LoF carriers (NF1), and genes with extremely sparse carrier counts (FBN1).

20
Beyond thresholds: a fully Bayesian framework for quantifying allele count evidence for variant pathogenicity

Konovalov, F. A.

2026-02-10 genetics 10.64898/2026.02.09.704882 medRxiv
Top 0.2%
3.8%
Show abstract

Allele count data from affected individuals and population controls are central to variant interpretation, yet their evidential meaning is often mediated by discrete thresholds and implicit assumptions. This work introduces a fully quantitative Bayesian framework for dominant rare disease genetics in which all allele count evidence is summarized by a single quantity, the Bayes factor, that evaluates the probability of observing the same data under two explicitly defined competing models. Rather than replacing individual ACMG/AMP pathogenicity criteria, the Bayes factor provides a unified measure that naturally incorporates evidence in both the pathogenic and benign directions. The framework accounts for variation in affected cohort size, penetrance, disease prevalence, and assay error rates, allowing these biologically and technically meaningful quantities to be specified directly instead of absorbed into fixed cutoffs. Application to a non-Finnish European population shows that the dependence of the Bayes factor on observed allele counts is strongly shaped by how the affected cohort is defined and by false positive rates in control datasets. Across representative scenarios, Bayes factor values are broadly compatible with established allele count criteria combinations expressed on odds-ratio scales under typical parameterizations, while remaining tunable beyond these defaults.